Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 13 de 13
Filtrar
1.
bioRxiv ; 2024 Jan 21.
Artículo en Inglés | MEDLINE | ID: mdl-38293135

RESUMEN

Dimensionality reduction-based data visualization is pivotal in comprehending complex biological data. The most common methods, such as PHATE, t-SNE, and UMAP, are unsupervised and therefore reflect the dominant structure in the data, which may be independent of expert-provided labels. Here we introduce a supervised data visualization method called RF-PHATE, which integrates expert knowledge for further exploration of the data. RF-PHATE leverages random forests to capture intricate featurelabel relationships. Extracting information from the forest, RF-PHATE generates low-dimensional visualizations that highlight relevant data relationships while disregarding extraneous features. This approach scales to large datasets and applies to classification and regression. We illustrate RF-PHATE's prowess through three case studies. In a multiple sclerosis study using longitudinal clinical and imaging data, RF-PHATE unveils a sub-group of patients with non-benign relapsingremitting Multiple Sclerosis, demonstrating its aptitude for time-series data. In the context of Raman spectral data, RF-PHATE effectively showcases the impact of antioxidants on diesel exhaust-exposed lung cells, highlighting its proficiency in noisy environments. Furthermore, RF-PHATE aligns established geometric structures with COVID-19 patient outcomes, enriching interpretability in a hierarchical manner. RF-PHATE bridges expert insights and visualizations, promising knowledge generation. Its adaptability, scalability, and noise tolerance underscore its potential for widespread adoption.

2.
IEEE Trans Pattern Anal Mach Intell ; 45(9): 10947-10959, 2023 Sep.
Artículo en Inglés | MEDLINE | ID: mdl-37015125

RESUMEN

Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest and measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly matches the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.

3.
IEEE Trans Pattern Anal Mach Intell ; 45(6): 7381-7394, 2023 Jun.
Artículo en Inglés | MEDLINE | ID: mdl-36374884

RESUMEN

A fundamental task in data exploration is to extract low dimensional representations that capture intrinsic geometry in data, especially for faithfully visualizing data in two or three dimensions. Common approaches use kernel methods for manifold learning. However, these methods typically only provide an embedding of the input data and cannot extend naturally to new data points. Autoencoders have also become popular for representation learning. While they naturally compute feature extractors that are extendable to new data and invertible (i.e., reconstructing original features from latent representation), they often fail at representing the intrinsic data geometry compared to kernel-based manifold learning. We present a new method for integrating both approaches by incorporating a geometric regularization term in the bottleneck of the autoencoder. This regularization encourages the learned latent representation to follow the intrinsic data geometry, similar to manifold learning algorithms, while still enabling faithful extension to new data and preserving invertibility. We compare our approach to autoencoder models for manifold learning to provide qualitative and quantitative evidence of our advantages in preserving intrinsic structure, out of sample extension, and reconstruction. Our method is easily implemented for big-data applications, whereas other methods are limited in this regard.

4.
Nature ; 591(7848): 99-104, 2021 03.
Artículo en Inglés | MEDLINE | ID: mdl-33627875

RESUMEN

Neuropil is a fundamental form of tissue organization within the brain1, in which densely packed neurons synaptically interconnect into precise circuit architecture2,3. However, the structural and developmental principles that govern this nanoscale precision remain largely unknown4,5. Here we use an iterative data coarse-graining algorithm termed 'diffusion condensation'6 to identify nested circuit structures within the Caenorhabditis elegans neuropil, which is known as the nerve ring. We show that the nerve ring neuropil is largely organized into four strata that are composed of related behavioural circuits. The stratified architecture of the neuropil is a geometrical representation of the functional segregation of sensory information and motor outputs, with specific sensory organs and muscle quadrants mapping onto particular neuropil strata. We identify groups of neurons with unique morphologies that integrate information across strata and that create neural structures that cage the strata within the nerve ring. We use high resolution light-sheet microscopy7,8 coupled with lineage-tracing and cell-tracking algorithms9,10 to resolve the developmental sequence and reveal principles of cell position, migration and outgrowth that guide stratified neuropil organization. Our results uncover conserved structural design principles that underlie the architecture and function of the nerve ring neuropil, and reveal a temporal progression of outgrowth-based on pioneer neurons-that guides the hierarchical development of the layered neuropil. Our findings provide a systematic blueprint for using structural and developmental approaches to understand neuropil organization within the brain.


Asunto(s)
Caenorhabditis elegans/embriología , Caenorhabditis elegans/metabolismo , Neurópilo/química , Neurópilo/metabolismo , Algoritmos , Animales , Encéfalo/citología , Encéfalo/embriología , Caenorhabditis elegans/química , Caenorhabditis elegans/citología , Movimiento Celular , Difusión , Interneuronas/metabolismo , Neuronas Motoras/metabolismo , Neuritas/metabolismo , Neurópilo/citología , Células Receptoras Sensoriales/metabolismo
5.
Biomed Opt Express ; 11(11): 6197-6210, 2020 Nov 01.
Artículo en Inglés | MEDLINE | ID: mdl-33282484

RESUMEN

We developed a hyperspectral imaging tool based on surface-enhanced Raman spectroscopy (SERS) probes to determine the expression level and visualize the distribution of PD-L1 in individual cells. Electron-microscopic analysis of PD-L1 antibody - gold nanorod conjugates demonstrated binding the cell surface and internalization into endosomal vesicles. Stimulation of cells with IFN-γ or metformin was used to confirm the ability of SERS probes to report treatment-induced changes. The multivariate curve resolution-alternating least squares (MCR-ALS) analysis of spectra provided a greater signal-noise ratio than single peak mapping. However, single peak mapping allowed a systematic subtraction of background and the removal of non-specific binding and endocytic SERS signals. The mean or maximum peak height in the cell or the mean peak height in the area of specific PD-L1 positive pixels was used to estimate the PD-L1 expression levels in single cells. The PD-L1 levels were significantly up-regulated by IFN-γ and inhibited by metformin in human lung cancer cells from the A549 cell line. In conclusion, the method of analyzing hyperspectral SERS imaging data together with systematic and comprehensive removal of non-specific signals allows SERS imaging to be a quantitative tool in the detection of the cancer biomarker, PD-L1.

6.
Anal Chim Acta ; 1128: 221-230, 2020 Sep 01.
Artículo en Inglés | MEDLINE | ID: mdl-32825906

RESUMEN

Diesel exhaust particles (DEPs) are major constituents of air pollution and associated with numerous oxidative stress-induced human diseases. In vitro toxicity studies are useful for developing a better understanding of species-specific in vivo conditions. Conventional in vitro assessments based on oxidative biomarkers are destructive and inefficient. In this study, Raman spectroscopy, as a non-invasive imaging tool, was used to capture the molecular fingerprints of overall cellular component responses (nucleic acid, lipids, proteins, carbohydrates) to DEP damage and antioxidant protection. We apply a novel data visualization algorithm called PHATE, which preserves both global and local structure, to display the progression of cell damage over DEP exposure time. Meanwhile, a mutual information (MI) estimator was used to identify the most informative Raman peaks associated with cytotoxicity. A health index was defined to quantitatively assess the protective effects of two antioxidants (resveratrol and mesobiliverdin IXα) against DEP induced cytotoxicity. In addition, a number of machine learning classifiers were applied to successfully discriminate different treatment groups with high accuracy. Correlations between Raman spectra and immunomodulatory cytokine and chemokine levels were evaluated. In conclusion, the combination of label-free, non-disruptive Raman micro-spectroscopy and machine learning analysis is demonstrated as a useful tool in quantitative analysis of oxidative stress induced cytotoxicity and for effectively assessing various antioxidant treatments, suggesting that this framework can serve as a high throughput platform for screening various potential antioxidants based on their effectiveness at battling the effects of air pollution on human health.


Asunto(s)
Antioxidantes , Material Particulado , Antioxidantes/farmacología , Humanos , Aprendizaje Automático , Estrés Oxidativo , Espectrometría Raman , Emisiones de Vehículos
7.
Nat Biotechnol ; 38(1): 108, 2020 Jan.
Artículo en Inglés | MEDLINE | ID: mdl-31896828

RESUMEN

An amendment to this paper has been published and can be accessed via a link at the top of the paper.

8.
Nat Biotechnol ; 37(12): 1482-1492, 2019 12.
Artículo en Inglés | MEDLINE | ID: mdl-31796933

RESUMEN

The high-dimensional data created by high-throughput technologies require visualization tools that reveal data structure and patterns in an intuitive form. We present PHATE, a visualization method that captures both local and global nonlinear structure using an information-geometric distance between data points. We compare PHATE to other tools on a variety of artificial and biological datasets, and find that it consistently preserves a range of patterns in data, including continual progressions, branches and clusters, better than other tools. We define a manifold preservation metric, which we call denoised embedding manifold preservation (DEMaP), and show that PHATE produces lower-dimensional embeddings that are quantitatively better denoised as compared to existing visualization methods. An analysis of a newly generated single-cell RNA sequencing dataset on human germ-layer differentiation demonstrates how PHATE reveals unique biological insight into the main developmental branches, including identification of three previously undescribed subpopulations. We also show that PHATE is applicable to a wide variety of data types, including mass cytometry, single-cell RNA sequencing, Hi-C and gut microbiome data.


Asunto(s)
Genómica/métodos , Ensayos Analíticos de Alto Rendimiento/métodos , Procesamiento de Imagen Asistido por Computador/métodos , Algoritmos , Animales , Macrodatos , Diferenciación Celular , Células Cultivadas , Simulación por Computador , Bases de Datos Genéticas , Microbioma Gastrointestinal , Humanos , Ratones , Análisis de Secuencia de ARN , Análisis de la Célula Individual
9.
Nat Methods ; 16(11): 1139-1145, 2019 11.
Artículo en Inglés | MEDLINE | ID: mdl-31591579

RESUMEN

It is currently challenging to analyze single-cell data consisting of many cells and samples, and to address variations arising from batch effects and different sample preparations. For this purpose, we present SAUCIE, a deep neural network that combines parallelization and scalability offered by neural networks, with the deep representation of data that can be learned by them to perform many single-cell data analysis tasks. Our regularizations (penalties) render features learned in hidden layers of the neural network interpretable. On large, multi-patient datasets, SAUCIE's various hidden layers contain denoised and batch-corrected data, a low-dimensional visualization and unsupervised clustering, as well as other information that can be used to explore the data. We analyze a 180-sample dataset consisting of 11 million T cells from dengue patients in India, measured with mass cytometry. SAUCIE can batch correct and identify cluster-based signatures of acute dengue infection and create a patient manifold, stratifying immune response to dengue.


Asunto(s)
Redes Neurales de la Computación , Análisis de la Célula Individual , Análisis por Conglomerados , Dengue/inmunología , Humanos , Linfocitos T/inmunología
10.
Proc IEEE Int Conf Big Data ; 2019: 2624-2633, 2019 Dec.
Artículo en Inglés | MEDLINE | ID: mdl-32747879

RESUMEN

Big data often has emergent structure that exists at multiple levels of abstraction, which are useful for characterizing complex interactions and dynamics of the observations. Here, we consider multiple levels of abstraction via a multiresolution geometry of data points at different granularities. To construct this geometry we define a time-inhomogemeous diffusion process that effectively condenses data points together to uncover nested groupings at larger and larger granularities. This inhomogeneous process creates a deep cascade of intrinsic low pass filters on the data affinity graph that are applied in sequence to gradually eliminate local variability while adjusting the learned data geometry to increasingly coarser resolutions. We provide visualizations to exhibit our method as a "continuously-hierarchical" clustering with directions of eliminated variation highlighted at each step. The utility of our algorithm is demonstrated via neuronal data condensation, where the constructed multiresolution data geometry uncovers the organization, grouping, and connectivity between neurons.

11.
Cell ; 174(3): 716-729.e27, 2018 07 26.
Artículo en Inglés | MEDLINE | ID: mdl-29961576

RESUMEN

Single-cell RNA sequencing technologies suffer from many sources of technical noise, including under-sampling of mRNA molecules, often termed "dropout," which can severely obscure important gene-gene relationships. To address this, we developed MAGIC (Markov affinity-based graph imputation of cells), a method that shares information across similar cells, via data diffusion, to denoise the cell count matrix and fill in missing transcripts. We validate MAGIC on several biological systems and find it effective at recovering gene-gene relationships and additional structures. Applied to the epithilial to mesenchymal transition, MAGIC reveals a phenotypic continuum, with the majority of cells residing in intermediate states that display stem-like signatures, and infers known and previously uncharacterized regulatory interactions, demonstrating that our approach can successfully uncover regulatory relations without perturbations.


Asunto(s)
Perfilación de la Expresión Génica/métodos , Análisis de Secuencia de ARN/métodos , Análisis de la Célula Individual/métodos , Algoritmos , Línea Celular , Epistasis Genética/genética , Redes Reguladoras de Genes/genética , Humanos , Cadenas de Markov , MicroARNs/genética , ARN Mensajero/genética , Programas Informáticos
12.
Entropy (Basel) ; 20(8)2018 Jul 27.
Artículo en Inglés | MEDLINE | ID: mdl-33265649

RESUMEN

Recent work has focused on the problem of nonparametric estimation of information divergence functionals between two continuous random variables. Many existing approaches require either restrictive assumptions about the density support set or difficult calculations at the support set boundary which must be known a priori. The mean squared error (MSE) convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounded density support sets is derived where knowledge of the support boundary, and therefore, the boundary correction is not required. The theory of optimally weighted ensemble estimation is generalized to derive a divergence estimator that achieves the parametric rate when the densities are sufficiently smooth. Guidelines for the tuning parameter selection and the asymptotic distribution of this estimator are provided. Based on the theory, an empirical estimator of Rényi-α divergence is proposed that greatly outperforms the standard kernel density plug-in estimator in terms of mean squared error, especially in high dimensions. The estimator is shown to be robust to the choice of tuning parameters. We show extensive simulation results that verify the theoretical results of our paper. Finally, we apply the proposed estimator to estimate the bounds on the Bayes error rate of a cell classification problem.

13.
Artículo en Inglés | MEDLINE | ID: mdl-27453693

RESUMEN

High frequency oscillations (HFOs) are a promising biomarker of epileptic brain tissue and activity. HFOs additionally serve as a prototypical example of challenges in the analysis of discrete events in high-temporal resolution, intracranial EEG data. Two primary challenges are 1) dimensionality reduction, and 2) assessing feasibility of classification. Dimensionality reduction assumes that the data lie on a manifold with dimension less than that of the features space. However, previous HFO analysis have assumed a linear manifold, global across time, space (i.e. recording electrode/channel), and individual patients. Instead, we assess both a) whether linear methods are appropriate and b) the consistency of the manifold across time, space, and patients. We also estimate bounds on the Bayes classification error to quantify the distinction between two classes of HFOs (those occurring during seizures and those occurring due to other processes). This analysis provides the foundation for future clinical use of HFO features and guides the analysis for other discrete events, such as individual action potentials or multi-unit activity.

SELECCIÓN DE REFERENCIAS
DETALLE DE LA BÚSQUEDA
...